Overview and Motivation

Yelp gets 142 million unique visitors a month and contains 2.1 million claimed local businesses as of the first quarter of 2015 (http://expandedramblings.com/index.php/yelp-statistics/). The primary way consumers utilize Yelp is by filtering through the categories, price range and ratings and reading through the recommended reviews to decide whether they would like try a new business. Relying on this crowd-based recommendation system might be particularly useful when traveling to a new city. One of the primary interest for travelers to a city might be the local dining options. As an increasing number of businesses and consumers rely on Yelp, sifting through hundreds of reviews to find a good restaurant can be time consuming. We thought it might be useful to have a Netflix-like recommendation system in Yelp that can analyze a user’s history of businesses they have frequented or reviews that they have posted from their home city and predict what restaurants they are likely to visit if they traveled to a new city.

Initial Questions

Our scientific goals are to learn which machine learning approach, or combination of approaches, are most effective for the task. We would like to infer the rating a user would give to a restaurant based on the user’s history. The higher the rating, the more likely the user is to like the restaurant. A particular inferential challenge of this project will be handling diverse data with unequal amounts of information on each subject. The specific quantity and quality of information will vary greatly from user to user and we want to learn how to explore various scientific solutions to this challenge.

We hope to learn how home city yelp histories are related to yelp activity in travel destination cities. This will require developing a judgment based classification system of assigning users to a home city since our dataset does not have this specific information. Some questions we want to consider are - How are travel dining preferences different from home city preferences? Are certain restaurants more popular with visitors than with locals (or vice versa)? Does a city have a signature genre of restaurants? How do the strengths of connections among the cities in the data set differ? Our goal is to address these questions and synthesize these insights into a rank based recommendation system for which restaurants a yelp user would most enjoy in a new travel destination city.

Data

We got the data directly from the Yelp website by entering the Yelp Dataset Challenge (https://www.yelp.com/dataset_challenge) It has data on businesses in 6 cities in the US (Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas and Madison), 2 cities in Canada (Montreal and Waterloo) and 1 city each in Germany (Karlsruhe) and U.K. (Edinburgh). This includes data on 2.2M reviews and 591K tips by 552K users for 77K businesses.

While most of the raw data is available on the link above, we also have some processed data on this Google Drive link (https://drive.google.com/drive/u/1/folders/0Bz4Yx2vJtaBVYzVkSVh4R21hX1U). Some of the files were too large to share on GitHub.

Analysis

We first want to load the datasets for businesses, users and reviews.

Businesses
#load businesses
lines <- readLines("yelp_academic_dataset_business.json") 
business_full <- lapply(lines, fromJSON)
rm(lines)
business_full <- data.frame(do.call('rbind', business_full))
business_full <- business_full %>% 
  mutate(state = as.character(state), business_id = as.character(business_id))

The naming of the city in the dataset is at the postal code level. We want to make the grouping at the metropolitan level so that we can classify tourists vs. locals. We use the grouping that yelp provided.

unique(business_full$state)
#some states seem to be misclassified
View(filter(business_full, state == "NM")[1:10,]) # Las Vegas in NM, remove 
View(filter(business_full, state == "TX")[1:10,]) # Dallas in Texas, remove 
View(filter(business_full, state == "EDH")[1:10,]) # EDH is Edinburgh 
View(filter(business_full, state == "MLN")[1:10,]) # MLN is Edinburgh
View(filter(business_full, state == "FIF")[1:10,]) # close to Edinburgh, classify as Edinburgh
View(filter(business_full, state == "ELN")[1:10,]) # close to Edinburgh, classify as Edinburgh
View(filter(business_full, state == "BW")[1:10,]) # Karlsruhe
View(filter(business_full, state == "RP")[1:10,]) # Karlsruhe
View(filter(business_full, state == "KHL")[1:10,]) # Edinburgh
View(filter(business_full, state == "NW")[1:10,]) # Karlsruhe
#group businesses by cities listed in yelp dataset
yelp_city <- matrix(c("PA", "Pittsburgh", 
                      "NC", "Charlotte", 
                      "SC", "Charlotte", 
                      "WI", "Madison", 
                      "IL", "Urbana-Champaign", 
                      "AZ", "Phoenix", 
                      "NV", "Las Vegas", 
                      "QC", "Montreal", 
                      "ON", "Waterloo", 
                      "EDH", "Edinburgh" ,
                      "MLN", "Edinburgh", 
                      "ELN", "Edinburgh", 
                      "BW", "Karlsruhe", 
                      "RP", "Karlsruhe"), 
                    ncol = 2, byrow = TRUE)
colnames(yelp_city) <- c("state", "yelp_city")
yelp_city <- data.frame(yelp_city)
#join with business
business_full <- left_join(business_full, yelp_city, by = "state")
business_full <- business_full %>%
  filter(!is.na(yelp_city)) %>%
  mutate(state = as.character(state))
#some businesses are in more than one category
#unpack the categories for each business
business_categories <- cbind(unlist(rep(business_full$business_id,lapply(business_full$categories,length))),
                             unlist(business_full$categories))
colnames(business_categories) <- c("business_id", "category")
business_categories <- data.frame(business_categories)
#spread to get one row per business with columns for each category
#this series of code will turn the dataset into wide format with categories as a logical variable
business_categories_wide <- data.frame(business_categories, 1)
business_categories_wide <- business_categories_wide %>% spread(category, X1)
business_categories_wide[is.na(business_categories_wide)] <- 0
#write.csv(business_categories, file = "business_categories.csv")
#get the categories as a list
cats <- data.frame(unlist(business_full$categories))
colnames(cats) <- c("categories")
#count how many times each category appears
#arrange by count
cats <- cats %>%
  group_by(categories) %>%
  summarise(n = n()) %>%
  ungroup %>%
  arrange(desc(n))
#pick the first 100 categories
high_cats <- cats$categories[1:100]
high_cats <- as.character(high_cats)
#join with business data frame
business_wide <-  select(business_full, business_id, yelp_city) %>%
  left_join(select(business_categories_wide, business_id, one_of(high_cats)))
Users

We first convert the users dataset into a .csv file. This allows the file to load faster for subsequent uses.

lines <- readLines("yelp_academic_dataset_user.json") 
users_full <- lapply(lines, fromJSON)
rm(lines)
users_full <- data.frame(do.call('rbind', users_full))
users_full <- apply(users_full, 2, as.character)
users_full <- data.frame(users_full)
write.csv(users_full, file = "users_full.csv")
#load users
users <- read_csv("users_full.csv")
Reviews

We had to set up the reviews dataset to convert to .csv in several steps because it was too large.

chunky <- function(lines,file_name) {
  jsoned <- lapply(lines, fromJSON)
  df <- data.frame(do.call('rbind',jsoned))
  df <- df[,-c(1,7)]
  df$user_id <- as.character(df$user_id)
  df$review_id <- as.character(df$review_id)
  df$stars <- as.numeric(df$stars)
  df$date <- as.character(df$date)
  df$text <- as.character(df$text)
  df$business_id <- as.character(df$business_id)
  write_csv(df,file_name)
  return("done!")
}

chunk_wrapper <- function(lines) {
  n <- length(lines)
  for(i in 1:ceiling(n/250000)) {
    start_row <- 1+(i-1)*250000
    end_row <- i*250000
    if(i==ceiling(n/250000)) { end_row = n}
    chunky(lines[start_row:end_row],paste0("reviews_",start_row,"_",end_row,".csv"))
  }
  return("really done!")
}
#combine the files to get reviews.csv

reviews_full <- read_csv("reviews_full.csv")
write.csv(reviews, file = "reviews_notext.csv")
#load reviews dataset
reviews <- read_csv("reviews_notext.csv")[-1]

Predicting Tourists vs. Locals

Our dataset does not have the information regarding the home city of each user. In this section, we attempt to make a classification of tourist vs. local for each user-city combination using some of the information we have.

To this end, we create some variable for each user such as number of places reviewed in each city (locals have more reviews in their home city), standard deviation (locals will have reviewed places in their home city over a longer period of time) and the number of reviews in each category for each user (some categories like “Home Services” might be reviewed more by locals).

Join Reviews, Businesses and Users and Create Variables
#join reviews and businesses
joined <- left_join(select(reviews, user_id, business_id, date), business_wide, by = "business_id")
#add number of categories reviewed in each city by each user
sum_categories <- select(joined, -business_id, -date) %>%
  group_by(user_id, yelp_city) %>%
  summarize_each(funs(sum)) %>% 
  ungroup
#get the number of places reviewed in each city and the standard deviation of the dates
joined <- select(joined, user_id, yelp_city, date) %>%
  left_join(select(users, user_id, review_count)) %>%
  group_by(user_id, yelp_city) %>%
  summarize(review_city = n(), sd_dates = sd(as.Date(date)))
## Joining by: "user_id"
#join to the table with summed categories
joined <- left_join(joined, sum_categories)
## Joining by: c("user_id", "yelp_city")
#join to users table
joined <- joined %>%
  left_join(select(users, user_id, review_count)) %>%
  ungroup
## Joining by: "user_id"
#reorder columns and rename review_count for users 
joined <- joined[,c(1:3, 105, 4:104)]
colnames(joined)[4] <- "review_total"
#set SD dates that are NA's to 0
joined$sd_dates[is.na(joined$sd_dates)] <- 0
#remove users that have 10 or less reviews
grouped_users <- joined %>% filter(review_total > 10)
#check if there are still users that have only reviewed in one city
grouped_users %>% ungroup %>%summarise(n = sum(review_city == review_total))
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1  2107
#remove users that have only reviewed in one city
#because they will have 100% of their reviews one city but this doesn't help with classifying them
grouped_users <- grouped_users %>%
  ungroup %>%
  filter(review_city != review_total)
#create columns that convert number of reviews into proportions
grouped_users <- grouped_users %>% 
  mutate(percent_city = review_city/review_total) %>% na.omit

Now, we will assign some of the users that we can be fairly certain are tourist or locals as such and build a regression model to eventually apply the model to the rest of the dataset.

To assign tourist/local designation to users, we use a conservative threshold for what percent of their total reviews would need to be in a particular city for them to be considered a tourist or a local in that city. We set 0.1 as the threshold for being a tourist i.e. users would need to have done 10% or less of their reviews in a city to be considered tourists in that city. And we set 0.8 as the threshold for being a local i.e. users would need to have done 80% or more of their reviews in a city to be considered locals in that city.

#create tourist vs local column
grouped_users <- grouped_users %>% 
  mutate(tourist = ifelse(percent_city < 0.1, 1, ifelse(percent_city > 0.8, 0, NA)))
#keep only the ones that have been classified as tourists or locals
grouped_users_tl <- grouped_users %>% filter(!is.na(tourist))
grouped_users_tl %>% group_by(tourist) %>% summarise(num = n(), review_count = mean(review_total), 
                                                     city_count = mean(review_city))
## Source: local data frame [2 x 4]
## 
##   tourist    num review_count city_count
##     (dbl)  (int)        (dbl)      (dbl)
## 1       0  11540     31.96187  28.798527
## 2       1 145146    109.24637   2.524127

We fit a logistic regression model with and without a LASSO penalty. We want to check if the LASSO penalty would prevent the model from overfitting to the training set by regularizing the contributions of some of the parameters.

Logistic Regression
set.seed(202)
#create train and test dataset
inTrain <- createDataPartition(y = grouped_users_tl$tourist, p=0.5)
train_set_tl <- slice(select(grouped_users_tl, -user_id, -yelp_city, -review_city, -review_total, -percent_city),
                   inTrain$Resample1)
test_set_tl <- slice(select(grouped_users_tl, -user_id, -yelp_city, -review_city, -review_total, -percent_city),
                  -inTrain$Resample1)
#run glm model
glm.tourist <- glm(tourist ~ ., family = binomial, data = train_set_tl)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.tourist)
## 
## Call:
## glm(formula = tourist ~ ., family = binomial, data = train_set_tl)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904   0.0431   0.0805   0.1035   8.4904  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     5.9961532  0.0635861  94.300  < 2e-16 ***
## sd_dates                       -0.0005344  0.0001606  -3.328 0.000874 ***
## Restaurants                    -0.3559045  0.0273510 -13.012  < 2e-16 ***
## Shopping                       -0.4466946  0.0704644  -6.339 2.31e-10 ***
## Food                           -0.0764235  0.0576321  -1.326 0.184820    
## `Beauty & Spas`                -1.3396638  0.1477410  -9.068  < 2e-16 ***
## `Health & Medical`             -1.4586177  0.1787024  -8.162 3.29e-16 ***
## Nightlife                       0.1283229  0.0865783   1.482 0.138298    
## `Home Services`                -1.7926465  0.1469037 -12.203  < 2e-16 ***
## Bars                           -0.1505227  0.0974534  -1.545 0.122453    
## Automotive                     -1.8088780  0.1574875 -11.486  < 2e-16 ***
## `Local Services`               -0.9933200  0.1370997  -7.245 4.32e-13 ***
## `Active Life`                  -0.1939946  0.0736374  -2.634 0.008427 ** 
## Fashion                         0.0286625  0.1847261   0.155 0.876694    
## `Event Planning & Services`    -0.5447381  0.1343232  -4.055 5.00e-05 ***
## `Fast Food`                    -0.2417276  0.0723695  -3.340 0.000837 ***
## Pizza                          -0.4444443  0.0541094  -8.214  < 2e-16 ***
## Mexican                        -0.4854461  0.0458183 -10.595  < 2e-16 ***
## `Hotels & Travel`               1.7877953  0.1395213  12.814  < 2e-16 ***
## `American (Traditional)`        0.2273066  0.0450313   5.048 4.47e-07 ***
## Sandwiches                      0.1624486  0.0583723   2.783 0.005386 ** 
## `Arts & Entertainment`         -0.0277359  0.0583915  -0.475 0.634788    
## `Coffee & Tea`                 -0.3578845  0.0695350  -5.147 2.65e-07 ***
## `Hair Salons`                  -0.1625889  0.1569397  -1.036 0.300204    
## Italian                        -0.0809937  0.0568670  -1.424 0.154370    
## Burgers                        -0.2321842  0.0562344  -4.129 3.65e-05 ***
## `Auto Repair`                  -0.3034827  0.2232134  -1.360 0.173954    
## Doctors                        -0.5740310  0.2906361  -1.975 0.048259 *  
## `Nail Salons`                  -0.5166313  0.1664683  -3.103 0.001913 ** 
## Chinese                        -0.2399499  0.0569412  -4.214 2.51e-05 ***
## `American (New)`                0.0406571  0.0396477   1.025 0.305146    
## `Home & Garden`                -0.2086535  0.2085807  -1.000 0.317142    
## Pets                           -1.4473477  0.3629759  -3.987 6.68e-05 ***
## `Fitness & Instruction`        -1.1791621  0.2267109  -5.201 1.98e-07 ***
## Hotels                          0.0701798  0.2059056   0.341 0.733228    
## Grocery                        -0.5059714  0.1116880  -4.530 5.89e-06 ***
## `Real Estate`                   0.5764424  0.4740113   1.216 0.223949    
## `Breakfast & Brunch`           -0.0321472  0.0450359  -0.714 0.475343    
## Dentists                        0.0695466  0.5317114   0.131 0.895935    
## `Specialty Food`                0.0915576  0.1090006   0.840 0.400923    
## `Women's Clothing`             -0.0410449  0.2423902  -0.169 0.865534    
## Bakeries                       -0.0647634  0.0837086  -0.774 0.439122    
## `Professional Services`        -0.7543446  0.2846755  -2.650 0.008053 ** 
## `Ice Cream & Frozen Yogurt`    -0.3039911  0.1028826  -2.955 0.003129 ** 
## Cafes                          -0.0865165  0.0789680  -1.096 0.273258    
## `Financial Services`           -2.2000231  0.5215550  -4.218 2.46e-05 ***
## Pubs                            0.0089577  0.0777705   0.115 0.908302    
## `Pet Services`                 -1.0749644  0.4890239  -2.198 0.027936 *  
## Japanese                        0.0499204  0.0667985   0.747 0.454866    
## `General Dentistry`            -0.3002220  0.6420350  -0.468 0.640064    
## `Sports Bars`                  -0.5210206  0.0928551  -5.611 2.01e-08 ***
## `Sushi Bars`                   -0.4623925  0.0690043  -6.701 2.07e-11 ***
## Apartments                     -0.9298809  0.5508328  -1.688 0.091385 .  
## Education                      -0.1402660  0.2952759  -0.475 0.634763    
## `Convenience Stores`           -0.1537277  0.2431506  -0.632 0.527235    
## Gyms                            0.1344937  0.3031612   0.444 0.657305    
## `Sporting Goods`               -0.0669325  0.1965260  -0.341 0.733421    
## `Skin Care`                    -0.3530776  0.2272197  -1.554 0.120207    
## `Cosmetics & Beauty Supply`     1.0494256  0.2653154   3.955 7.64e-05 ***
## Desserts                        0.5921127  0.0864382   6.850 7.38e-12 ***
## `Chicken Wings`                -0.4010382  0.1179227  -3.401 0.000672 ***
## Delis                          -0.0589924  0.1035867  -0.569 0.569018    
## `Day Spas`                      0.8371774  0.1647131   5.083 3.72e-07 ***
## `Hair Removal`                  0.0703769  0.2031577   0.346 0.729031    
## Seafood                        -0.0642442  0.0678862  -0.946 0.343970    
## Drugstores                      0.2839653  0.2507534   1.132 0.257446    
## `Men's Clothing`                0.5906090  0.2877055   2.053 0.040090 *  
## Massage                         0.9645520  0.2165619   4.454 8.43e-06 ***
## Tires                          -0.3877702  0.3017994  -1.285 0.198841    
## Accessories                     0.4028619  0.2991755   1.347 0.178118    
## `Flowers & Gifts`              -0.0667495  0.2526263  -0.264 0.791609    
## Lounges                         0.0033027  0.0879820   0.038 0.970056    
## Steakhouses                     0.1541856  0.0638148   2.416 0.015686 *  
## Jewelry                        -0.5788581  0.2351399  -2.462 0.013826 *  
## `Books, Mags, Music & Video`    0.6927312  0.1462881   4.735 2.19e-06 ***
## `Oil Change Stations`          -0.2475667  0.3008700  -0.823 0.410601    
## `Arts & Crafts`                 0.5610681  0.2686067   2.089 0.036725 *  
## `Department Stores`             1.0071455  0.2299165   4.380 1.18e-05 ***
## Mediterranean                  -0.2991872  0.0870239  -3.438 0.000586 ***
## `Dry Cleaning & Laundry`       -0.4472951  0.3625446  -1.234 0.217290    
## Barbeque                       -0.3156261  0.0795624  -3.967 7.28e-05 ***
## `Gas & Service Stations`        2.4907501  0.3512135   7.092 1.32e-12 ***
## Barbers                        -0.6382606  0.2854472  -2.236 0.025352 *  
## Trainers                        0.1986394  0.3304376   0.601 0.547746    
## `Asian Fusion`                  0.1721267  0.0832548   2.067 0.038690 *  
## `Banks & Credit Unions`         1.4511201  0.6336881   2.290 0.022024 *  
## `Public Services & Government`  0.6302393  0.2481155   2.540 0.011082 *  
## Thai                           -0.1409520  0.0773182  -1.823 0.068301 .  
## `Beer, Wine & Spirits`          0.6774906  0.1526553   4.438 9.08e-06 ***
## `Furniture Stores`              0.4865864  0.3202224   1.520 0.128630    
## `Pet Groomers`                  1.2077524  0.4582890   2.635 0.008405 ** 
## Veterinarians                  -0.5742026  0.4085322  -1.406 0.159865    
## `Auto Parts & Supplies`         0.6821131  0.3307159   2.063 0.039157 *  
## `Venues & Event Spaces`         1.0030304  0.1970093   5.091 3.56e-07 ***
## `Cosmetic Dentists`            -0.5274160  0.5938378  -0.888 0.374461    
## French                          0.4656792  0.0953207   4.885 1.03e-06 ***
## `Performing Arts`              -0.0650986  0.1071084  -0.608 0.543332    
## Optometrists                   -0.0753107  0.4370995  -0.172 0.863205    
## `Shoe Stores`                   0.3869707  0.3394732   1.140 0.254321    
## `Juice Bars & Smoothies`       -1.0022978  0.1310382  -7.649 2.03e-14 ***
## Indian                         -0.4971457  0.1151667  -4.317 1.58e-05 ***
## Buffets                         0.6656441  0.0740122   8.994  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41119.3  on 78342  degrees of freedom
## Residual deviance:  6957.9  on 78241  degrees of freedom
## AIC: 7161.9
## 
## Number of Fisher Scoring iterations: 9
#test accuracy on train and test set
glm_train_prob <- round(predict(glm.tourist, train_set_tl, type = "response"))
glm_test_prob <- round(predict(glm.tourist, test_set_tl, type = "response"))
#Confusion Matrix for train set
tab <- table(glm_train_prob, train_set_tl$tourist)
conf_matrix <- confusionMatrix(tab)
conf_matrix$table
##               
## glm_train_prob     0     1
##              0  5041   350
##              1   712 72240
conf_matrix$overall["Accuracy"]
##  Accuracy 
## 0.9864442
#Confusion Matrix for test set
tab <- table(glm_test_prob, test_set_tl$tourist)
conf_matrix <- confusionMatrix(tab)
conf_matrix$table
##              
## glm_test_prob     0     1
##             0  5106   333
##             1   681 72223
conf_matrix$overall["Accuracy"]
##  Accuracy 
## 0.9870569
LASSO
#glmnet needs covariates in a matrix form
x_train <- as.matrix(train_set_tl %>% select(-tourist)) 
x_test <-as.matrix(test_set_tl %>% select(-tourist)) 
#run LASSO without tuning
tourist.lasso <-glmnet(x_train, y = as.factor(train_set_tl$tourist),alpha=1,family='binomial')
plot(tourist.lasso,xvar="lambda")
title(main = "Coefficients at different Log Lambda", cex.main = 1, line = 3)

#use cross validation to find optimal value of lambda (penalty parameter)
cv.lasso <- cv.glmnet(x_train, y=train_set_tl$tourist, alpha=1)
plot(cv.lasso)
title( main = "CV MSE vs Log Lambda", cex.main = 1, line = 3)

best_lambda_l <- cv.lasso$lambda.min
#run LASSO with the best_lambda
tourist.lasso_best<- glmnet(x_train, as.factor(train_set_tl$tourist),
                    alpha=1,family='binomial', lambda = best_lambda_l)

#test accuracy on train and test set
lasso_train_prob <- round(predict(tourist.lasso_best, x_train, type = "response"))
lasso_test_prob <- round(predict(tourist.lasso_best, x_test, type = "response"))

#Confusion Matrix for train set
tab <- table(lasso_train_prob, train_set_tl$tourist)
conf_matrix <- confusionMatrix(tab)
conf_matrix$table
##                 
## lasso_train_prob     0     1
##                0  4923   327
##                1   830 72263
conf_matrix$overall["Accuracy"]
##  Accuracy 
## 0.9852316
#Confusion Matrix for test set
tab <- table(lasso_test_prob, test_set_tl$tourist)
conf_matrix <- confusionMatrix(tab)
conf_matrix$table
##                
## lasso_test_prob     0     1
##               0  4955   332
##               1   832 72224
conf_matrix$overall["Accuracy"]
##  Accuracy 
## 0.9851423

The accuracies are similar for the two methods and the LASSO shrinks most of the coefficients that were non-significant from the GLM. We decided to got with the GLM model without the LASSO penalty based on the slightly higher accuracy in the test dataset.

However, we can use the coefficients from the LASSO to see what predictors were significant in identifying tourists vs locals. Since LASSO normalizes the data, the absoute magnitude of the coefficients gives an idea of importance of the variables.

#Coefficients of lasso betas
lasso_betas <- data.frame(cbind(row.names(as.matrix(tourist.lasso_best$beta)),
                                as.matrix(tourist.lasso_best$beta))) %>%
  filter(s0 !=0) %>% rename(cat = V1) %>% rename(val = s0) %>% 
  mutate(val = as.numeric(as.character(val)), cat = as.factor(cat))
lasso_betas %>% 
  mutate(local = as.factor(ifelse(val>0,0,1))) %>% 
  ggplot(aes(x=reorder(cat,val),y=val)) + geom_bar(stat="identity", aes(fill=local)) + 
  coord_flip() + 
  labs(y="Coefficient Value", 
       x="Business Category",title="Business Category Review Count Predictors for classifying locals and tourists")

Here we can see that reviewing categories like Hotels and Travels make a user more likely to be tourist whereas categories like HOme Servides and Pets make them more likely to be a local.

Going back to the GLM results, to make prediction on the full dataset, we use conservative thresholds on the results of the regression as well. We decided to only classify as tourists if the predicted probability of being a tourist is greater than 0.6 and classify as locals if the predicted probability is less than 0.4.

Predictions on Full Dataset
#apply logistic regression model to the full dataset
grouped_users_pred <- grouped_users %>%   
  mutate(pred_tourist_prob = predict(glm.tourist, grouped_users, type = "response"))
#classify as tourist, local or unsure (if between 0.4 and 0.6)
grouped_users_pred <- grouped_users_pred %>%   
  mutate(pred_tourist = ifelse(pred_tourist_prob > 0.6, "tourist", 
                               ifelse(pred_tourist_prob <0.4, "local", "unsure")))
#"Confusion Matrix"
grouped_users_pred %>% 
  filter(!is.na(pred_tourist)) %>%
  mutate(tourist = ifelse(tourist == 1, "tourist", 
                          ifelse(tourist == 0, "local", NA))) %>%
  filter(!is.na(tourist)) %>%
  group_by(tourist) %>%
  summarise(Correct = sum(tourist == pred_tourist),
            Incorrect= sum(tourist != pred_tourist))
## Source: local data frame [2 x 3]
## 
##   tourist Correct Incorrect
##     (chr)   (int)     (int)
## 1   local    9873      1667
## 2 tourist  144342       804
#check percent of reviews in the city and the total number of reviews for users in each category
grouped_users_pred %>% group_by(pred_tourist) %>% summarise(percent_city = mean(percent_city),
                                                            review_total = mean(review_total))
## Source: local data frame [3 x 3]
## 
##   pred_tourist percent_city review_total
##          (chr)        (dbl)        (dbl)
## 1        local   0.67934736     61.82995
## 2      tourist   0.09171367     88.67800
## 3       unsure   0.52425656     46.90095
#plot histogram
grouped_users_pred %>% ggplot(aes(x=pred_tourist_prob)) + geom_histogram() + scale_y_sqrt() + geom_vline(xintercept=.4,colour="red",size=1) + geom_vline(xintercept=.6,colour="red",size=1) + labs(x = "Predicted Probability of being a tourist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Caveats of the Tourist Classification Model

This model is based on a forced initial classification (based on <0.1, >0.8 thresholds) and while it can be justified and our thresholds were pretty conservative, there might be several issues because we do not have a way of verifying our thresholds.

For instance, our initial classification might be affected by the number of total reviews for each user. We see here that most people classified as tourists are the ones that generally have more total reviews. Some of them might be frequent travelers and they might review in their home cities just as often as their travel cities.

Also, from our confusion matrices, we can see that our models are skewed by the tourists since we have a lot more tourists than locals in our dataset (tourists can be from any part of the world, locals need to be from one of the 10 cities in the dataset). So although the model seems to have a high accuracy overall, the error is much larger among the locals.

We ran some spot checks to see how well the model did by reading some of the actual reviews in Montreal and Las Vegas.

#montreal locals and tourists
montreal_local <- grouped_users_pred %>% filter(yelp_city == "Montreal" & pred_tourist == "local") 
montreal_tourist <- grouped_users_pred %>% filter(yelp_city == "Montreal" & pred_tourist == "tourist") 

vegas_local <- grouped_users_pred %>% filter(yelp_city == "Las Vegas" & pred_tourist == "local") 
vegas_tourist <- grouped_users_pred %>% filter(yelp_city == "Las Vegas" & pred_tourist == "tourist") 
#MONTREAL LOCALS
smpl  <- sample(seq(1, nrow(montreal_local), 1), 10)

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_local$user_id[smpl[1]],],select(business,yelp_city,
                                                                                            business_id))
#can't tell, in French

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_local$user_id[smpl[2]],],select(business,yelp_city,
                                                                                            business_id))
#might be local

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_local$user_id[smpl[3]],],select(business,yelp_city,
                                                                                            business_id))
#probably local

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_local$user_id[smpl[4]],],select(business,yelp_city,
                                                                                            business_id))
#probably local

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_local$user_id[smpl[5]],],select(business,yelp_city,
                                                                                            business_id))
#local

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_local$user_id[smpl[6]],],select(business,yelp_city,
                                                                                            business_id))
#French

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_local$user_id[smpl[7]],],select(business,yelp_city,
                                                                                            business_id))
#French, probably local
#MONTEREAL TOURISTS
smpl  <- sample(seq(1, nrow(montreal_tourist), 1), 10)

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_tourist$user_id[smpl[1]],],select(business,yelp_city,
                                                                                              business_id))
#incorrect

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_tourist$user_id[smpl[2]],],select(business,yelp_city,
                                                                                              business_id))
#correct

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_tourist$user_id[smpl[3]],],select(business,yelp_city,
                                                                                              business_id))
#probably correct

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_tourist$user_id[smpl[4]],],select(business,yelp_city,
                                                                                              business_id))
#correct

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_tourist$user_id[smpl[5]],],select(business,yelp_city,
                                                                                              business_id))
#seems correct, local in Champagne, frequently visits Montreal userID CX4GcrCCnzfNyAl0cPqGbg

test_reviews <- left_join(reviews[reviews_full$user_id==montreal_tourist$user_id[smpl[6]],],select(business,yelp_city,
                                                                                              business_id))
#correct
#VEGAS LOCALS
smpl  <- sample(seq(1, nrow(vegas_local), 1), 10)

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_local$user_id[smpl[1]],],select(business,yelp_city,
                                                                                         business_id))
#correct

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_local$user_id[smpl[2]],],select(business,yelp_city,
                                                                                         business_id))
#seems correct

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_local$user_id[smpl[3]],],select(business,yelp_city,
                                                                                         business_id))
#correct

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_local$user_id[smpl[4]],],select(business,yelp_city,
                                                                                         business_id))
#correct

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_local$user_id[smpl[5]],],select(business,yelp_city,
                                                                                         business_id))
#correct
#VEGAS TOURISTS
smpl  <- sample(seq(1, nrow(vegas_tourist), 1), 10)

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_tourist$user_id[smpl[1]],],select(business,yelp_city,
                                                                                         business_id))
#correct

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_tourist$user_id[smpl[2]],],select(business,yelp_city,
                                                                                         business_id))
#seems correct

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_tourist$user_id[smpl[3]],],select(business,yelp_city,
                                                                                         business_id))
#correct

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_tourist$user_id[smpl[4]],],select(business,yelp_city,
                                                                                         business_id))
#correct, review in German

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_tourist$user_id[smpl[5]],],select(business,yelp_city,
                                                                                         business_id))
#seems incorrect

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_tourist$user_id[smpl[6]],],select(business,yelp_city,
                                                                                         business_id))
#seems incorrect

test_reviews <- left_join(reviews[reviews_full$user_id==vegas_tourist$user_id[smpl[7]],],select(business,yelp_city,
                                                                                         business_id))
#seems correct
Filtering Users that are Classified as Local/Tourist in at least one city each
#save users that have been classified
users_trstvsloc <- grouped_users_pred %>% select(user_id, yelp_city, pred_tourist)
#write.csv(users_trstvsloc, file = "user_touristvlocal.csv")
#long format with cities as columns
users_trstvsloc_spread <- users_trstvsloc %>% spread(yelp_city, pred_tourist)
#write.csv(users_trstvsloc_spread, file = "user_touristvlocal_spread.csv")
#select users that have been classified as local in at least one city and tourist in at least one
yc_locals_trst <- users_trstvsloc %>% 
  group_by(user_id) %>%
  mutate(n=n()) %>%
  ungroup %>%
  filter(n>1) %>%
  filter(pred_tourist == "local")
#some users are locals in more than 1 city
twice_locals <- yc_locals_trst %>% group_by(user_id) %>% summarise(n = n()) %>% filter(n>1)
#check how many cities they've been classified in 
twice_locals <- users_trstvsloc %>%
  filter(user_id %in% twice_locals$user_id) %>% 
  group_by(user_id) %>% mutate(n = n()) %>% unique
#people that are not classified as tourists anywhere
twice_locals_only <- unique(filter(twice_locals, n<3)$user_id)
#filter out people that are not classified as tourists anywhere
yc_locals_trst <- yc_locals_trst %>% filter(!(user_id %in% twice_locals_only))
#wide version of the same data with cities as columns
locals_tourist <- right_join(users_trstvsloc_spread, select(yc_locals_trst, user_id))
## Joining by: "user_id"

This will be the dataset we will use to build our model and make prediction because we have information on the users’ home city and travel city preferences.

Predicting Ratings on Yelp for Tourists

We will now build our model to make predictions for yelpers.

Business Effect

Business effect is the average stars the business receives from the reviews in our dataset.

#filter just the dining businesses
business_food <- business_full %>%
  filter(grepl("Restaurants", as.character(categories)) | 
           grepl("Cafe", as.character(categories)))

set_users <- unique(yc_locals_trst$user_id)
mu <- mean(reviews %>% filter(!(user_id %in% set_users)) %>% .$stars)
business2 <- reviews %>% filter(!(user_id %in% set_users)) %>% group_by(business_id)  %>%  
  summarise(business_stars = mean(stars),b_n=n()) %>%
  left_join(select(business_food, business_id, yelp_city, attributes),
            select(.,business_id,business_stars,b_n),by="business_id")
Tourist Effect

Tourist effect is the average of the stars for reviews by a tourist for each business.

business2 <- reviews %>% 
  filter(!(user_id %in% set_users)) %>%  
  left_join(select(business2,business_id,yelp_city))%>% 
  left_join(users_trstvsloc,by=c("user_id","yelp_city"))%>% 
  filter(pred_tourist=="tourist") %>% group_by(business_id) %>% 
  summarise(tourist_stars = mean(stars),t_n=n()) %>% 
  left_join(business2,select(.,business_id,tourist_i, t_n),by="business_id")
#if there weren't any tourists who reviewed the place, set the tourist stars equal to the mean for the business
business2$tourist_stars[is.na(business2$tourist_stars)] <- business2$business_stars[is.na(business2$tourist_stars)]
User Local Effects

This is the average ratings for users in their home city.

user_effects <- select(reviews,review_id,user_id,business_id,stars) %>% 
  filter(user_id %in% set_users) %>% 
  left_join(select(business2,business_id,yelp_city)) %>% 
  left_join(users_trstvsloc,by=c("user_id","yelp_city")) %>% 
  filter(pred_tourist=="local") %>% 
  group_by(user_id) %>% 
  summarise(user_local_stars = mean(stars),user_local_n=n())
User Local Ratings by business category

This is the average ratings for users by business category in their home city.

#The user category effect takes the average ratings for each category attribute from the user's home city and connects these averages to the category attributes of the business in the review that is being predicted.
user_categories <- select(reviews,review_id,user_id,business_id,stars) %>% 
  filter(user_id %in% set_users) %>% 
  left_join(select(business2,business_id,yelp_city)) %>% 
  left_join(users_trstvsloc,by=c("user_id","yelp_city")) %>%  
  filter(pred_tourist=="local") %>%  
  inner_join(business_categories,.) %>% 
  filter(!(category %in% c("Restaurants","Cafe","Food"))) %>% 
  group_by(user_id,category) %>% 
  summarise(user_category_stars = mean(stars),user_cat_n=n())
User Local Ratings by business attributes

This is the average ratings for users by business attributes in their home city.

#make function to make table where each attribute for a business gets a new row
make_attributes <- function(df) {
  return_df <- NULL
  for(i in 1:nrow(df)) {
    atr_i <- unlist(df$attributes[i])
    if(!is.null(atr_i)) {return_df <- rbind(return_df,cbind(df$business_id[i],names(atr_i),matrix(atr_i)))}
  }
  return_df <- data.frame(return_df)
  names(return_df) <- c("business_id","attribute","value")
  return(return_df)
}

#make each attribute for a business get a new row for the businesses reviewed by the users that have been selected
business_att <- select(reviews, -text) %>%
  filter(user_id %in% set_users) %>%
  left_join(select(business2, business_id, attributes)) %>%
  select(business_id, attributes) %>%
  make_attributes
write.csv(business_att, file = "business_att.csv")
business_attributes <- read.csv("business_att.csv")
business_attributes <- business_attributes %>% filter(value != FALSE)
#compute the average for each business attribute by user
user_attributes <- select(reviews,review_id,user_id,business_id,stars) %>% 
  filter(user_id %in% set_users) %>% 
  left_join(select(business2,business_id,yelp_city)) %>% 
  left_join(users_trstvsloc,by=c("user_id","yelp_city")) %>%  
  filter(pred_tourist=="local") %>%  
  inner_join(business_attributes,.) %>% 
  group_by(user_id,attribute,value) %>% 
  summarise(user_attribute_stars = mean(stars),user_at_n=n())
User Ratings by business attribute (ambience)
#The user ambience effect takes the average ratings for each ambience attribute from the user's home city and connects these averages to the ambience attributes of the business in the review that is being predicted.
business_ambience <- select(business_attributes,-value) %>% filter(grepl("Ambience",attribute))
user_ambience <- select(user_attributes,-value) %>% filter(grepl("Ambience",attribute))

ambience_reviews  <- reviews %>%
  select(.,review_id,user_id,business_id,stars) %>%
  filter(user_id %in% set_users) %>%
  inner_join(business_ambience,.,by="business_id") %>%
  left_join(.,user_ambience, by=c("user_id","attribute")) %>%
  mutate(user_ambience_stars = ifelse(is.na(user_attribute_stars),0,user_attribute_stars),
         user_ambience_n = ifelse(is.na(user_at_n),0,user_at_n)) %>%
  group_by(review_id) %>%
  summarise(user_ambience_stars=mean(user_ambience_stars),user_ambience_n=sum(user_ambience_n))
User Ratings by business attribute (“good for”)
#The user 'good for' effect takes the average ratings for each 'good for' attribute from the user's home city and connects these averages to the 'good for' attributes of the business in the review that is being predicted.
user_good_for <- select(user_attributes,-value) %>% filter(grepl("Good For",attribute))
business_good_for <- select(business_attributes,-value) %>% filter(grepl("Good For",attribute))

good_for_reviews  <- reviews %>%
  select(.,review_id,user_id,business_id,stars) %>%
  filter(user_id %in% set_users) %>%
  inner_join(business_good_for,.,by="business_id") %>%
  left_join(.,user_good_for, by=c("user_id","attribute")) %>%
  mutate(user_good_for_stars = ifelse(is.na(user_attribute_stars),0,user_attribute_stars),
         user_good_for_n = ifelse(is.na(user_at_n),0,user_at_n)) %>%
  group_by(review_id) %>%
  summarise(user_good_for_stars=mean(user_good_for_stars),user_good_for_n=sum(user_good_for_n))
Price Range reviews
#The user price range effect takes the average ratings for each price range attribute from the user's home city and connects these averages to the price range attributes of the business in the review that is being predicted.
user_price <- user_attributes %>%
  rename(price_range=value) %>%
  filter(grepl("Price Range",attribute)) %>%
  select(.,-attribute) 

business_price_ranges <- business_attributes %>% 
  filter(grepl("Price Range",attribute)) %>% 
  select(business_id,value) %>% 
  rename(price_range=value)

price_range_reviews <- reviews %>%
  select(.,review_id,user_id,business_id,stars) %>%
  filter(user_id %in% set_users) %>%
  left_join(.,business_price_ranges) %>%
  left_join(.,user_price, by=c("user_id","price_range")) %>%
  mutate(user_price_stars = ifelse(is.na(user_attribute_stars),0,user_attribute_stars),
         user_price_n = ifelse(is.na(user_at_n),0,user_at_n)) %>%
  group_by(review_id) %>%
  summarise(user_price_stars=mean(user_price_stars), user_price_n=mean(user_price_n))
Join all the variables created so far to the review dataset
#First join eveything except the attribute effects
category_reviews <- reviews %>%
  select(.,review_id,user_id,business_id,stars) %>%
  filter(user_id %in% set_users) %>%
  left_join(.,select(business2,business_id,tourist_stars,t_n,business_stars,b_n,yelp_city)) %>%
  left_join(.,users_trstvsloc,by=c("user_id","yelp_city")) %>%  filter(pred_tourist=="tourist") %>%
  left_join(.,select(user_effects,user_id,user_local_stars,user_local_n)) %>%
  inner_join(business_categories,.,by="business_id") %>%
  filter(!(category %in% c("Restaurants","Cafe","Food"))) %>%
  left_join(.,user_categories,by="category") %>%
  mutate(user_category_stars = ifelse(is.na(user_category_stars),0,user_category_stars),
         user_cat_n = ifelse(is.na(user_cat_n),0,user_cat_n)) %>%
  group_by(review_id,business_id) %>%
  summarise(stars=mean(stars),
            user_local_stars=mean(user_local_stars),
            user_local_n=mean(user_local_n),
            tourist_stars=mean(tourist_stars),
            t_n=mean(t_n),
            business_stars=mean(business_stars),
            b_n=mean(b_n),
            user_category_stars=mean(user_category_stars),
            user_cat_n=sum(user_cat_n)) 
write.csv(category_reviews, file = "category_reviews.csv")

category_reviews <- read.csv("category_reviews.csv")[-1]

#Now add attributes
review_set <- unique(category_reviews$review_id)
dat <- left_join(category_reviews,ambience_reviews %>% filter(review_id %in% review_set))
dat <- left_join(dat,good_for_reviews %>% filter(review_id %in% review_set))
dat <- left_join(dat,price_range_reviews %>% filter(review_id %in% review_set))

dat <- left_join(dat,select(business_full,business_id,yelp_city))
dat <- left_join(dat,select(reviews,review_id,user_id))
dat <- left_join(dat,select(yc_locals_trst,user_id,yelp_city), by="user_id")
dat <- rename(dat, business_city = yelp_city.x)
dat <- rename(dat, user_city = yelp_city.y)


dat <- dat %>%
  mutate(business_stars = ifelse(0 ==business_stars,mu,business_stars)) %>%
  mutate(user_local_stars = ifelse(0 ==user_local_stars,mu,user_local_stars)) %>%
  mutate(tourist_stars = ifelse(0 ==tourist_stars,business_stars,user_local_stars)) %>%
  mutate(user_category_stars = ifelse(0 ==user_category_stars,user_local_stars,user_category_stars)) %>%
  mutate(user_ambience_stars = ifelse(0 ==user_ambience_stars,user_local_stars,user_ambience_stars)) %>%
  mutate(user_good_for_stars = ifelse(0 ==user_good_for_stars,user_local_stars,user_good_for_stars)) %>%
  mutate(user_price_stars = ifelse(0 ==user_price_stars,user_local_stars,user_price_stars)) 

write_csv(dat, "final_dataset.csv")
dat <- read_csv("final_dataset.csv")

Models for Yelp Recommendation Predictions

set.seed(1)
inTrain <- createDataPartition(y = dat$stars, p=0.5)
train_reviews <- slice(dat, inTrain$Resample1)
test_reviews <- slice(dat, -inTrain$Resample1)
Multinomial with LASSO penalty
#We model the data using multinomial logistic regression to predict the star category. We use lasso with cross-validation to chose the parameters.
x <- as.matrix(train_reviews[,4:17])
multinomial_lasso <- glmnet(x,y=as.factor(train_reviews$stars),alpha=1,family='multinomial')
plot(multinomial_lasso,xvar="lambda", 
     main = "Coefficients at different Log Lambda", cex.main = 1)

#CV to pick penalty 
cv.multinomial.lasso <- cv.glmnet(x,y=train_reviews$stars,alpha=1)
plot(cv.multinomial.lasso)
title(main="MSE vs Log Lambda", cex.main = 1, line = 3)

best_multinomial_lambda <- cv.multinomial.lasso$lambda.min
#run with best lambda
lasso_multinomial_best <-glmnet(x,y=as.factor(train_reviews$stars),alpha=1,family='multinomial',
                                lambda=best_multinomial_lambda)
#betas
lasso_multinomial_best$beta
## $`1`
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                               s0
## user_local_stars    -0.419521561
## user_local_n        -0.002189755
## tourist_stars       -0.001861042
## t_n                  .          
## business_stars      -0.715140754
## b_n                  .          
## user_category_stars  .          
## user_cat_n           .          
## user_ambience_stars  .          
## user_ambience_n      .          
## user_good_for_stars  .          
## user_good_for_n      .          
## user_price_stars     .          
## user_price_n         .          
## 
## $`2`
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## user_local_stars    -0.31196230
## user_local_n         .         
## tourist_stars        .         
## t_n                  .         
## business_stars      -0.40824287
## b_n                  .         
## user_category_stars -0.35694578
## user_cat_n           .         
## user_ambience_stars  .         
## user_ambience_n      .         
## user_good_for_stars  .         
## user_good_for_n      .         
## user_price_stars    -0.02990086
## user_price_n         .         
## 
## $`3`
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                                s0
## user_local_stars     .           
## user_local_n         1.280848e-03
## tourist_stars        .           
## t_n                  .           
## business_stars       .           
## b_n                 -6.326671e-06
## user_category_stars  .           
## user_cat_n           1.511592e-07
## user_ambience_stars  .           
## user_ambience_n      1.997905e-04
## user_good_for_stars  6.746205e-03
## user_good_for_n      .           
## user_price_stars     .           
## user_price_n         1.668073e-03
## 
## $`4`
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## user_local_stars    .          
## user_local_n        0.001101784
## tourist_stars       0.284612578
## t_n                 .          
## business_stars      0.675353500
## b_n                 .          
## user_category_stars .          
## user_cat_n          .          
## user_ambience_stars 0.011842926
## user_ambience_n     .          
## user_good_for_stars 0.004358944
## user_good_for_n     .          
## user_price_stars    0.016205921
## user_price_n        .          
## 
## $`5`
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                                s0
## user_local_stars     8.733693e-01
## user_local_n         .           
## tourist_stars        .           
## t_n                  .           
## business_stars       1.645639e+00
## b_n                  5.399750e-05
## user_category_stars  .           
## user_cat_n          -5.237706e-07
## user_ambience_stars  .           
## user_ambience_n      .           
## user_good_for_stars  .           
## user_good_for_n      1.623056e-04
## user_price_stars     .           
## user_price_n        -4.129805e-03
#predictions
train_pred <- predict(lasso_multinomial_best, newx = as.matrix(train_reviews[,4:17]), type = "class")
test_pred <- predict(lasso_multinomial_best, newx = as.matrix(test_reviews[,4:17]), type = "class")
#Confusion Matrix on Train Set
mean(train_pred == train_reviews$stars)
## [1] 0.4101287
tab_train <- table(train_pred, train_reviews$stars)
conf_train <- confusionMatrix(tab_train)
conf_train
## Confusion Matrix and Statistics
## 
##           
## train_pred   1   2   3   4   5
##          1   3   0   0   0   0
##          2  46  41  24  18  14
##          3  18  42  48  39   9
##          4 167 330 576 951 580
##          5  42 103 210 624 933
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4101          
##                  95% CI : (0.3962, 0.4242)
##     No Information Rate : 0.3387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1344          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                       Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.0108696  0.07946 0.055944   0.5827   0.6074
## Specificity          1.0000000  0.97629 0.972727   0.4812   0.7017
## Pos Pred Value       1.0000000  0.28671 0.307692   0.3652   0.4880
## Neg Pred Value       0.9433022  0.89840 0.826255   0.6924   0.7925
## Prevalence           0.0572852  0.10710 0.178082   0.3387   0.3188
## Detection Rate       0.0006227  0.00851 0.009963   0.1974   0.1936
## Detection Prevalence 0.0006227  0.02968 0.032379   0.5405   0.3968
## Balanced Accuracy    0.5054348  0.52787 0.514336   0.5319   0.6546
#Confusion Matrix on Test Set
mean(test_pred == test_reviews$stars)
## [1] 0.407638
tab_test <- table(test_pred, test_reviews$stars)
conf_test <- confusionMatrix(tab_test)
conf_test
## Confusion Matrix and Statistics
## 
##          
## test_pred   1   2   3   4   5
##         1   4   0   1   0   0
##         2  50  38  28  15   8
##         3  26  42  56  29   8
##         4 180 281 594 943 597
##         5  59  95 196 645 923
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4076          
##                  95% CI : (0.3937, 0.4217)
##     No Information Rate : 0.3387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1312          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                       Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.0125392 0.083333  0.06400   0.5778   0.6009
## Specificity          0.9997777 0.976845  0.97337   0.4815   0.6968
## Pos Pred Value       0.8000000 0.273381  0.34783   0.3634   0.4812
## Neg Pred Value       0.9345523 0.910665  0.82414   0.6901   0.7886
## Prevalence           0.0662100 0.094645  0.18161   0.3387   0.3188
## Detection Rate       0.0008302 0.007887  0.01162   0.1957   0.1916
## Detection Prevalence 0.0010378 0.028850  0.03342   0.5386   0.3981
## Balanced Accuracy    0.5061585 0.530089  0.51869   0.5297   0.6489
Linear Regression with LASSO penalty
linear_lasso <- glmnet(x,y=train_reviews$stars,alpha=1)
plot(linear_lasso,xvar="lambda")
title(main = "Coefficients at different Log Lambda", cex.main = 1, line = 3)

#CV to pick lambda
cv.linear.lasso <- cv.glmnet(x,y=train_reviews$stars,alpha=1)
plot(cv.linear.lasso)
title(main="MSE vs Log Lambda", cex.main = 1, line = 3)

best_lambda <- cv.linear.lasso$lambda.min
#run with best lambda
lasso_linear_best <-glmnet(x,y=train_reviews$stars,alpha=1, lambda=best_lambda)
#betas
lasso_linear_best$beta
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                                s0
## user_local_stars     4.760840e-01
## user_local_n         .           
## tourist_stars        .           
## t_n                  .           
## business_stars       8.104072e-01
## b_n                  1.705149e-05
## user_category_stars  1.207575e-01
## user_cat_n          -2.466039e-07
## user_ambience_stars  .           
## user_ambience_n      .           
## user_good_for_stars  .           
## user_good_for_n      2.327264e-04
## user_price_stars     1.203507e-03
## user_price_n        -1.095969e-03
#predictions
train_pred <- predict(lasso_linear_best, newx = as.matrix(train_reviews[,4:17]), type = "response")
test_pred <- predict(lasso_linear_best, newx = as.matrix(test_reviews[,4:17]), type = "response")
#Confusion Matrix on Train Set
mean(round(train_pred) == train_reviews$stars)
## [1] 0.3497302
tab_train <- table(round(train_pred), train_reviews$stars)
conf_train <- confusionMatrix(tab_train)
conf_train
## Confusion Matrix and Statistics
## 
##    
##        1    2    3    4    5
##   1    2    0    0    0    0
##   2   26   30   19    6    5
##   3  139  248  307  373  154
##   4  105  231  521 1222 1253
##   5    4    7   11   31  124
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3497          
##                  95% CI : (0.3363, 0.3634)
##     No Information Rate : 0.3387          
##     P-Value [Acc > NIR] : 0.05528         
##                                           
##                   Kappa : 0.0802          
##  Mcnemar's Test P-Value : < 2e-16         
## 
## Statistics by Class:
## 
##                       Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.0072464 0.058140  0.35781   0.7488  0.08073
## Specificity          1.0000000 0.986983  0.76919   0.3377  0.98385
## Pos Pred Value       1.0000000 0.348837  0.25143   0.3667  0.70056
## Neg Pred Value       0.9431063 0.897295  0.84682   0.7241  0.69576
## Prevalence           0.0572852 0.107098  0.17808   0.3387  0.31880
## Detection Rate       0.0004151 0.006227  0.06372   0.2536  0.02574
## Detection Prevalence 0.0004151 0.017850  0.25342   0.6916  0.03674
## Balanced Accuracy    0.5036232 0.522561  0.56350   0.5433  0.53229
#Confusion Matrix on Test Set
mean(round(test_pred) == test_reviews$stars)
## [1] 0.3659195
tab_test <- table(round(test_pred), test_reviews$stars)
conf_test <- confusionMatrix(tab_test)
conf_test
## Confusion Matrix and Statistics
## 
##    
##        1    2    3    4    5
##   1    1    0    0    0    0
##   2   34   29   17    6    5
##   3  157  199  353  356  150
##   4  126  225  497 1228 1229
##   5    1    3    8   42  152
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3659          
##                  95% CI : (0.3523, 0.3797)
##     No Information Rate : 0.3387          
##     P-Value [Acc > NIR] : 3.913e-05       
##                                           
##                   Kappa : 0.1024          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                       Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.0031348 0.063596  0.40343   0.7525  0.09896
## Specificity          1.0000000 0.985786  0.78138   0.3481  0.98355
## Pos Pred Value       1.0000000 0.318681  0.29053   0.3716  0.73786
## Neg Pred Value       0.9339838 0.909668  0.85512   0.7330  0.69991
## Prevalence           0.0662100 0.094645  0.18161   0.3387  0.31880
## Detection Rate       0.0002076 0.006019  0.07327   0.2549  0.03155
## Detection Prevalence 0.0002076 0.018888  0.25218   0.6860  0.04276
## Balanced Accuracy    0.5015674 0.524691  0.59241   0.5503  0.54125
KNN
#Sam's KNN function
split_data <- function(data,prop) {
  n <- nrow(data)
  rands <- rnorm(n)
  training <- rands < quantile(rands,prop)
  return(list("training" = data[training,], "test" = data[training==FALSE,]))
}

train_sam <- function(it,prop,test,training,inputs,outcome,k) {
  accuracy <- NULL
  for(i in 1:it) {
    cross_val_train = split_data(training,prop)
    predictions <- knn(cross_val_train$training[,inputs],cross_val_train$test[,inputs],cross_val_train$training[[outcome]],k)
    accuracy <- c(accuracy,mean(as.numeric(predictions)==as.numeric(test[[outcome]])))
  }
  return(accuracy)
}

knn_master <- function(it,prop,test,training,inputs,outcome,min_k,max_k){
  ks <- seq(min_k,max_k,50)
  accuracy <- rep(0,length(ks))
  for(i in 1:length(ks)) {
    accuracy_k <- train_sam(it,prop,test,training,inputs,outcome,ks[i])
    accuracy[i] <- mean(accuracy_k)
  }
  return(cbind(ks,accuracy))
}
#CV to check best k
knn_test <- knn_master(10,.1,test_reviews,train_reviews,names(train_reviews)[4:6],"stars",1,251)

data.frame(knn_test) %>% ggplot(aes(x=ks, y= accuracy)) + geom_line() + labs(x="# K in KNN",y="Cross-validated accuracy on train")

knn_prediction <- knn(train_reviews[,names(train_reviews)[4:6]],test_reviews[,names(test_reviews)[4:6]],train_reviews[["stars"]],50)

#Accuracy
mean(round(as.numeric(knn_prediction))==as.numeric(test_reviews$stars))
## [1] 0.3675799
Random Forest
#remove character variables
train_reviews_rf <- train_reviews[,3:17]
#parallel processing
registerDoMC(cores=3)
#CV to pick best mtry
ctrl = trainControl(method="cv", number=10)
trf = train(factor(stars)~., 
            data=train_reviews_rf, 
            method="rf", 
            metric="Accuracy",
            trControl=ctrl,
            allowParallel=TRUE)
print(trf)
## Random Forest 
## 
## 4818 samples
##   14 predictor
##    5 classes: '1', '2', '3', '4', '5' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4336, 4338, 4336, 4335, 4336, 4336, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##    2    0.3985074  0.1367105  0.01321344   0.01879675
##    8    0.3883453  0.1304467  0.01740376   0.02445771
##   14    0.3939414  0.1392006  0.01484096   0.02279629
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
#fit with mtry = 2
rf_fit <- randomForest(factor(stars)~., data=train_reviews_rf, ntree = 100, mtry = 2)
#predictions
f_hat_rf <- predict(rf_fit, newdata = train_reviews, type="response")
f_hat_rf2 <- predict(rf_fit, newdata = test_reviews, type="response")

#Confusion Matrix on Train Set
mean(f_hat_rf == train_reviews$stars)
## [1] 0.9956413
tab_rftrain <- table(f_hat_rf, train_reviews$stars)
conf_rftrain <- confusionMatrix(tab_rftrain)
conf_rftrain
## Confusion Matrix and Statistics
## 
##         
## f_hat_rf    1    2    3    4    5
##        1  275    0    0    0    0
##        2    0  515    0    0    2
##        3    0    0  853    2    1
##        4    1    1    3 1628    7
##        5    0    0    2    2 1526
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9956          
##                  95% CI : (0.9933, 0.9973)
##     No Information Rate : 0.3387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9941          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.99638   0.9981   0.9942   0.9975   0.9935
## Specificity           1.00000   0.9995   0.9992   0.9962   0.9988
## Pos Pred Value        1.00000   0.9961   0.9965   0.9927   0.9974
## Neg Pred Value        0.99978   0.9998   0.9987   0.9987   0.9970
## Prevalence            0.05729   0.1071   0.1781   0.3387   0.3188
## Detection Rate        0.05708   0.1069   0.1770   0.3379   0.3167
## Detection Prevalence  0.05708   0.1073   0.1777   0.3404   0.3176
## Balanced Accuracy     0.99819   0.9988   0.9967   0.9969   0.9961
#Confusion Matrix on Test Set
mean(f_hat_rf2 == test_reviews$stars)
## [1] 0.4134496
tab_rftest <- table(f_hat_rf2, test_reviews$stars)
conf_rftest <- confusionMatrix(tab_rftest)
conf_rftest
## Confusion Matrix and Statistics
## 
##          
## f_hat_rf2   1   2   3   4   5
##         1  17  16  13   8   4
##         2  44  45  55  52  19
##         3  53  65 155 142  62
##         4 126 221 450 884 560
##         5  79 109 202 546 891
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4134          
##                  95% CI : (0.3995, 0.4275)
##     No Information Rate : 0.3387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1603          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.053292  0.09868  0.17714   0.5417   0.5801
## Specificity          0.990887  0.96103  0.91834   0.5741   0.7148
## Pos Pred Value       0.293103  0.20930  0.32495   0.3945   0.4877
## Neg Pred Value       0.936555  0.91071  0.83414   0.7097   0.7844
## Prevalence           0.066210  0.09465  0.18161   0.3387   0.3188
## Detection Rate       0.003528  0.00934  0.03217   0.1835   0.1849
## Detection Prevalence 0.012038  0.04462  0.09900   0.4651   0.3792
## Balanced Accuracy    0.522089  0.52986  0.54774   0.5579   0.6474

We get similar accuracies using the multinomial logistic with the random forest and the multinomial logistic regression. We pick the random forest due to the slightly higher accuracy.

The accuracy is not too high with any of our models. As a compromise, we decided to have our response be whether we would recommend the business or not. We decided to recommend the business only if the number of stars is 5 and fit a new model with the random forest.

Binomial Recommendation Prediction with LASSO penalty
#create new train set that has a column for whether or not the stars is 5
#remove stars
train_reviews_r <- train_reviews_rf %>% mutate(recommend = ifelse(stars>4, 1,0))
train_reviews_r <- train_reviews_r[-1]
train_reviews <- train_reviews %>% mutate(recommend = ifelse(stars>4, 1,0))
test_reviews <- test_reviews %>% mutate(recommend = ifelse(stars>4, 1,0))

#We model the data using multinomial logistic regression to predict whether the rating is 5 or not. We use lasso with cross-validation to chose the parameters.
x <- as.matrix(train_reviews_r[,1:14])
binomial_lasso <- glmnet(x,y=as.factor(train_reviews_r$recommend),alpha=1,family='binomial')
plot(binomial_lasso,xvar="lambda", 
     main = "Coefficients at different Log Lambda", cex.main = 1)

#CV to pick penalty 
cv.binomial_lasso <- cv.glmnet(x,y=train_reviews_r$recommend,alpha=1)
plot(cv.binomial_lasso)
title(main="MSE vs Log Lambda", cex.main = 1, line = 3)

best_binomial_lambda <- cv.binomial_lasso$lambda.min
#run with best lambda
lasso_binomial_best <-glmnet(x,y=as.factor(train_reviews_r$recommend),alpha=1,family='binomial',
                                lambda=best_binomial_lambda)
#betas
lasso_binomial_best$beta
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                                s0
## user_local_stars     1.123882e+00
## user_local_n        -3.637280e-03
## tourist_stars       -2.541404e-01
## t_n                 -4.783393e-04
## business_stars       1.420599e+00
## b_n                  4.336053e-04
## user_category_stars  8.801662e-02
## user_cat_n          -1.859688e-06
## user_ambience_stars -1.107927e-02
## user_ambience_n      3.878890e-04
## user_good_for_stars -2.983266e-02
## user_good_for_n      3.518228e-03
## user_price_stars    -1.050560e-02
## user_price_n        -7.835870e-03
#predictions
train_pred <- predict(lasso_binomial_best, newx = as.matrix(train_reviews[,4:17]), type = "class")
test_pred <- predict(lasso_binomial_best, newx = as.matrix(test_reviews[,4:17]), type = "class")
#Confusion Matrix on Train Set
mean(train_pred == train_reviews$recommend)
## [1] 0.7152345
tab_train <- table(train_pred, train_reviews$recommend)
conf_train <- confusionMatrix(tab_train)
conf_train
## Confusion Matrix and Statistics
## 
##           
## train_pred    0    1
##          0 2971 1061
##          1  311  475
##                                           
##                Accuracy : 0.7152          
##                  95% CI : (0.7023, 0.7279)
##     No Information Rate : 0.6812          
##     P-Value [Acc > NIR] : 1.664e-07       
##                                           
##                   Kappa : 0.2465          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9052          
##             Specificity : 0.3092          
##          Pos Pred Value : 0.7369          
##          Neg Pred Value : 0.6043          
##              Prevalence : 0.6812          
##          Detection Rate : 0.6166          
##    Detection Prevalence : 0.8369          
##       Balanced Accuracy : 0.6072          
##                                           
##        'Positive' Class : 0               
## 
#Confusion Matrix on Test Set
mean(test_pred == test_reviews$recommend)
## [1] 0.7160648
tab_test <- table(test_pred, test_reviews$recommend)
conf_test <- confusionMatrix(tab_test)
conf_test
## Confusion Matrix and Statistics
## 
##          
## test_pred    0    1
##         0 2992 1078
##         1  290  458
##                                           
##                Accuracy : 0.7161          
##                  95% CI : (0.7031, 0.7288)
##     No Information Rate : 0.6812          
##     P-Value [Acc > NIR] : 8.468e-08       
##                                           
##                   Kappa : 0.243           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9116          
##             Specificity : 0.2982          
##          Pos Pred Value : 0.7351          
##          Neg Pred Value : 0.6123          
##              Prevalence : 0.6812          
##          Detection Rate : 0.6210          
##    Detection Prevalence : 0.8447          
##       Balanced Accuracy : 0.6049          
##                                           
##        'Positive' Class : 0               
## 
Linear Regression for Recommendation with LASSO penalty
linear_lasso <- glmnet(x,y=train_reviews_r$recommend,alpha=1)
plot(linear_lasso,xvar="lambda")
title(main = "Coefficients at different Log Lambda", cex.main = 1, line = 3)

#CV to pick lambda
cv.linear.lasso <- cv.glmnet(x,y=train_reviews$recommend,alpha=1)
plot(cv.linear.lasso)
title(main="MSE vs Log Lambda", cex.main = 1, line = 3)

best_lambda <- cv.linear.lasso$lambda.min
#run with best lambda
lasso_linear_best <-glmnet(x,y=train_reviews_r$recommend,alpha=1, lambda=best_lambda)
#betas
lasso_linear_best$beta
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                                s0
## user_local_stars     1.865496e-01
## user_local_n        -5.178513e-04
## tourist_stars       -1.866025e-02
## t_n                 -1.298065e-04
## business_stars       2.292111e-01
## b_n                  1.119701e-04
## user_category_stars  2.997415e-02
## user_cat_n          -5.303926e-07
## user_ambience_stars -2.310836e-03
## user_ambience_n      2.724408e-05
## user_good_for_stars -7.314148e-03
## user_good_for_n      4.838905e-04
## user_price_stars    -4.240855e-03
## user_price_n        -1.047985e-03
#predictions
train_pred <- predict(lasso_linear_best, newx = as.matrix(train_reviews[,4:17]), type = "response")
test_pred <- predict(lasso_linear_best, newx = as.matrix(test_reviews[,4:17]), type = "response")
#Confusion Matrix on Train Set
mean(round(train_pred) == train_reviews$recommend)
## [1] 0.7114985
tab_train <- table(round(train_pred), train_reviews$recommend)
conf_train <- confusionMatrix(tab_train)
conf_train
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 3068 1176
##   1  214  360
##                                           
##                Accuracy : 0.7115          
##                  95% CI : (0.6985, 0.7243)
##     No Information Rate : 0.6812          
##     P-Value [Acc > NIR] : 2.861e-06       
##                                           
##                   Kappa : 0.203           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9348          
##             Specificity : 0.2344          
##          Pos Pred Value : 0.7229          
##          Neg Pred Value : 0.6272          
##              Prevalence : 0.6812          
##          Detection Rate : 0.6368          
##    Detection Prevalence : 0.8809          
##       Balanced Accuracy : 0.5846          
##                                           
##        'Positive' Class : 0               
## 
#Confusion Matrix on Test Set
mean(round(test_pred) == test_reviews$recommend)
## [1] 0.7156496
tab_test <- table(round(test_pred), test_reviews$recommend)
conf_test <- confusionMatrix(tab_test)
conf_test
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 3091 1179
##   1  191  357
##                                           
##                Accuracy : 0.7156          
##                  95% CI : (0.7027, 0.7284)
##     No Information Rate : 0.6812          
##     P-Value [Acc > NIR] : 1.189e-07       
##                                           
##                   Kappa : 0.2102          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9418          
##             Specificity : 0.2324          
##          Pos Pred Value : 0.7239          
##          Neg Pred Value : 0.6515          
##              Prevalence : 0.6812          
##          Detection Rate : 0.6416          
##    Detection Prevalence : 0.8863          
##       Balanced Accuracy : 0.5871          
##                                           
##        'Positive' Class : 0               
## 
Binomial Recommendation Prediction with KNN
#Sam's KNN function
split_data <- function(data,prop) {
  n <- nrow(data)
  rands <- rnorm(n)
  training <- rands < quantile(rands,prop)
  return(list("training" = data[training,], "test" = data[training==FALSE,]))
}

train_sam <- function(it,prop,test,training,inputs,outcome,k) {
  accuracy <- NULL
  for(i in 1:it) {
    cross_val_train = split_data(training,prop)
    predictions <- knn(cross_val_train$training[,inputs],cross_val_train$test[,inputs],cross_val_train$training[[outcome]],k)
    accuracy <- c(accuracy,mean(as.numeric(predictions)==as.numeric(test[[outcome]])))
  }
  return(accuracy)
}

knn_master <- function(it,prop,test,training,inputs,outcome,min_k,max_k){
  ks <- seq(min_k,max_k,50)
  accuracy <- rep(0,length(ks))
  for(i in 1:length(ks)) {
    accuracy_k <- train_sam(it,prop,test,training,inputs,outcome,ks[i])
    accuracy[i] <- mean(accuracy_k)
  }
  return(cbind(ks,accuracy))
}
#CV to check best k
knn_test <- knn_master(10,.1,test_reviews,train_reviews_r,names(train_reviews_r)[c(1,3,5,7,9)],"recommend",1,251)

data.frame(knn_test) %>% ggplot(aes(x=ks, y= accuracy)) + geom_line() + labs(x="# K in KNN",y="Cross-validated accuracy on train")

knn_prediction <- knn(train_reviews[,names(train_reviews)[c(4,6,8,10,12)]],test_reviews[,names(test_reviews)[c(4,6,8,10,12)]],train_reviews[["recommend"]],200)

#Accuracy
mean(round(as.numeric(knn_prediction))==as.numeric(test_reviews$recommend))
## [1] 0.2384807
Binomial Recommendation Prediction with Random Forest
#CV to pick mtry
ctrl2 = trainControl(method="cv", number=10)
trf2 = train(factor(recommend)~., 
            data=train_reviews_r, 
            method="rf", 
            metric="Accuracy",
            trControl=ctrl2,
            allowParallel=TRUE)
print(trf2)
## Random Forest 
## 
## 4818 samples
##   14 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4337, 4336, 4335, 4336, 4337, 4337, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##    2    0.7067147  0.2314448  0.01493641   0.03822154
##    8    0.7036009  0.2452126  0.01700497   0.04161417
##   14    0.7042281  0.2534940  0.01563823   0.03826444
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
#run Random Forest
rf_fit2 <- randomForest(factor(recommend)~., data=train_reviews_r, ntree = 100, mtry = 8)
#predict on train and test set
f_hat_rf <- predict(rf_fit2, newdata = train_reviews, type="response")
f_hat_rf2 <- predict(rf_fit2, newdata = test_reviews, type="response")
#Confusion Matrix on Train Set
mean(f_hat_rf == train_reviews$recommend)
## [1] 0.9973018
tab_rftrain <- table(f_hat_rf, train_reviews$recommend)
conf_rftrain <- confusionMatrix(tab_rftrain)
conf_rftrain
## Confusion Matrix and Statistics
## 
##         
## f_hat_rf    0    1
##        0 3270    1
##        1   12 1535
##                                           
##                Accuracy : 0.9973          
##                  95% CI : (0.9954, 0.9986)
##     No Information Rate : 0.6812          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9938          
##  Mcnemar's Test P-Value : 0.005546        
##                                           
##             Sensitivity : 0.9963          
##             Specificity : 0.9993          
##          Pos Pred Value : 0.9997          
##          Neg Pred Value : 0.9922          
##              Prevalence : 0.6812          
##          Detection Rate : 0.6787          
##    Detection Prevalence : 0.6789          
##       Balanced Accuracy : 0.9978          
##                                           
##        'Positive' Class : 0               
## 
#Confusion Matrix on Test Set
mean(f_hat_rf2 == test_reviews$recommend)
## [1] 0.7054795
tab_rftest <- table(f_hat_rf2, test_reviews$recommend)
conf_rftest <- confusionMatrix(tab_rftest)
conf_rftest
## Confusion Matrix and Statistics
## 
##          
## f_hat_rf2    0    1
##         0 2825  962
##         1  457  574
##                                           
##                Accuracy : 0.7055          
##                  95% CI : (0.6924, 0.7183)
##     No Information Rate : 0.6812          
##     P-Value [Acc > NIR] : 0.0001444       
##                                           
##                   Kappa : 0.2569          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8608          
##             Specificity : 0.3737          
##          Pos Pred Value : 0.7460          
##          Neg Pred Value : 0.5567          
##              Prevalence : 0.6812          
##          Detection Rate : 0.5863          
##    Detection Prevalence : 0.7860          
##       Balanced Accuracy : 0.6172          
##                                           
##        'Positive' Class : 0               
## 
Variable Importance by Random Forest
variable_importance <- importance(rf_fit2) 
tmp <- data_frame(feature = rownames(variable_importance),
                  Gini = variable_importance[,1]) %>%
  arrange(desc(Gini))
kable(tmp[1:10,])
feature Gini
business_stars 353.2224
user_good_for_stars 192.9086
user_local_stars 158.5924
tourist_stars 143.6957
user_cat_n 139.4086
b_n 132.2270
user_ambience_stars 130.1691
user_category_stars 129.9506
user_good_for_n 128.9676
user_local_n 128.0501

This shows it’s much easier to identify a rating of 5 to use for highly recommended restaurants than it is to predict the exact star rating. Any of the models except for KNN predicted the recommendation with moderate accuracy.

Visualizations

Create latitude and longitude attributes for the yelp cities
city_longlat <- matrix(c("Pittsburgh", 40.4406, -79.9959, 
                        "Charlotte", 35.2271, -80.8431,
                        "Madison", 43.0731, -89.4012,
                        "Urbana-Champaign", 40.1106, -88.2073,
                        "Phoenix", 33.4484, -112.0740,
                        "Las Vegas", 36.1699, -115.1398,
                        "Montreal", 45.5017, -73.5673,
                        "Waterloo", 43.4643, -80.5204,
                        "Edinburgh", 55.9533, -3.1883,
                        "Karlsruhe", 49.0069, 8.4037), 
                      ncol = 3, byrow = TRUE)
city_longlat <- data.frame(city_longlat)
colnames(city_longlat) <- c("yelp_city", "yc_latitude", "yc_longitude")
#change to numeric
city_longlat <- city_longlat %>% mutate(yc_longitude = as.numeric(levels(yc_longitude))[yc_longitude],
                                        yc_latitude = as.numeric(levels(yc_latitude))[yc_latitude])
Creating dataset for Yelpers traveling within US
#join users that are local and tourists in at least one city each to the reviews and business datasets
locals_map <- yc_locals_trst %>%
  left_join(reviews) %>%
  inner_join(business_full, by = "business_id")
## Joining by: "user_id"
colnames(locals_map)[2] <- "user_city"
colnames(locals_map)[6] <- "review_stars"
colnames(locals_map)[19] <- "business_stars"
colnames(locals_map)[23] <- "business_city"
#remove business specific geographical information
locals_map1 <- locals_map %>% select(user_id, user_city, business_id, state, business_city, date)
#keep only US
locals_map_travel <- filter(locals_map1, user_city != business_city)
locals_map_travel <- locals_map_travel %>% left_join(city_longlat, by = c("user_city" = "yelp_city"))
locals_map_travel <- locals_map_travel[, c(1,2, 8, 7, 3:6)]
colnames(locals_map_travel)[3:4] <- c("uc_lon", "uc_lat")
locals_map_travel <- locals_map_travel %>% left_join(city_longlat, by = c("business_city" = "yelp_city"))
locals_map_travel <- locals_map_travel[, c(1:7, 10, 9, 8)]
colnames(locals_map_travel)[8:9] <- c("bc_lon", "bc_lat")
#gather information on how many yelp users traveled to what cities
locals_map_travel2 <- locals_map_travel %>%
  group_by(user_city, business_city) %>%
  mutate(num_travel = n()) %>%
  select(user_city, uc_lon, uc_lat, business_city, bc_lon, bc_lat, state, num_travel) %>% unique
#color code US cities
us_city <- c("Pittsburgh","Charlotte","Madison","Urbana-Champaign","Phoenix","Las Vegas")
#population data in 100,000s by MSA in 2010/2014
us_city_pop <- c(23, 22, 8, 2, 20, 42)
us_city_color <- c("midnightblue", "purple4", "blue", "seagreen", "red", "brown")
us_city_color <- data.frame(cbind(us_city, us_city_pop, us_city_color))
us_city_color <- us_city_color %>% left_join(city_longlat, by = c("us_city"="yelp_city")) %>%
  mutate(us_city_pop = as.numeric(as.character(us_city_pop)))
## Warning in left_join_impl(x, y, by$x, by$y): joining factors with different
## levels, coercing to character vector
#join with travel data, get number of people traveling from a city divided by the city's population
us_travel <- locals_map_travel2 %>% 
  filter((user_city %in% us_city) & (business_city %in% us_city)) %>% 
  left_join(us_city_color, by=c("user_city" = "us_city")) %>%
  mutate(pop_num = num_travel/us_city_pop) 
## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
Mapping Yelp Users Travel within US

Here we examine the number of yelpers in our dataset traveling within the US. Each map shows people traveling into a city from the other 5 cities in the US. The transparency of the lines connecting two cities represents the number of users traveling (lower transparency means fewer people).

dev.off()
## null device 
##           1
write.csv(us_city_color, "us_city_color.csv")
write.csv(us_travel, "us_travel.csv")
for(i in 1:length(us_city)){
  par(mar=c(1,1,1,1))
  map <- map("state", col="mintcream", fill=TRUE, bg="white", lwd=0.05)
  points(us_city_color$yc_longitude,
         us_city_color$yc_latitude,
         pch=16,cex=1.5, 
         col = as.character(us_city_color$us_city_color))
  dest_city <- filter(us_travel, business_city == us_city[i])
  plt <- colorRampPalette(c("#f2f2f2", as.character(us_city_color$us_city_color[i])))
  colors <- plt(max(dest_city$pop_num))
  for (j in 1:nrow(dest_city)) {
    inter <- gcIntermediate(c(dest_city$uc_lon[j], dest_city$uc_lat[j]), 
                            c(dest_city$bc_lon[j], dest_city$bc_lat[j]), 
                            n=100, addStartEnd=TRUE)
    c_index <- round(dest_city$pop_num[j])
    lines(inter, col=colors[c_index], lwd=2)
  }
  text(us_city_color$yc_longitude,
       us_city_color$yc_latitude,
       us_city_color$us_city,
       cex=0.5,adj=0,pos=1,col="black")
  title(main = paste("Map of Yelp Users Traveling to", us_city[i]), cex.main = 1, line = 1)
  }

Vegas Phoenix Urbana-Champaign Madison Charlotte Pittsburgh

There are a lot of people traveling from Phoenix to almost all of the other cities. Most of the other travels seem to be related to the physical closeness of the cities.

Creating Dataset for locations visited by tourists vs. locals

Now we look at all the cities in our dataset and try to examine whether there are areas in the city that are preferred by tourists versus locals.

all_map <- users_trstvsloc %>%
  left_join(reviews) %>%
  inner_join(business_full, by = "business_id")
## Joining by: "user_id"
colnames(all_map)[2] <- "user_city"
colnames(all_map)[5] <- "review_stars"
colnames(all_map)[18] <- "business_stars"
colnames(all_map)[22] <- "business_city"
all_map2 <- select(all_map, user_id, user_city, review_id:business_id, review_count, longitude, 
                      latitude, state, business_stars, business_city)
rm(all_map)
#create indicator variables for tourists vs. locals
all_map2 <- all_map2 %>% mutate(tourist = (user_city == business_city)) %>% 
  mutate(longitude = as.numeric(longitude), latitude = as.numeric(latitude))
#get average ratings by tourists vs. locals
all_map3 <- all_map2 %>%
  group_by(business_id, tourist) %>%
  summarise(tourist_stars = mean(review_stars), tourist_review_count = n()) 
#create color coding and labels for the reviewer (locals vs. tourists)
all_map4 <- full_join(all_map2, all_map3) %>% 
  select(business_id, review_count, longitude, latitude, state, 
         business_stars, business_city, tourist, tourist_stars) %>% 
  unique %>%
  mutate(cr = ifelse(tourist, "blue", "yellow"),
         reviewer = ifelse(tourist, "tourist", "local")) %>% na.omit
## Joining by: c("business_id", "tourist")
Mapping locations dined by Tourists vs Locals in each City
world_city <- c("Pittsburgh","Charlotte", "Madison", "Urbana-Champaign", "Phoenix", "Las Vegas","Montreal", "Waterloo", "Edinburgh","Karlsruhe")
#for each city map the locations reviewed by tourists vs locals, blue tourists, yellow locals
for(i in 1:length(world_city)){
  i = 6
  dest_city <- all_map4 %>% filter(business_city == world_city[i])
  new_map <-leaflet(data = dest_city) %>% 
  addTiles(urlTemplate = "http://{s}.tiles.wmflabs.org/bw-mapnik/{z}/{x}/{y}.png", 
           attribution = '&copy; <a href="http://www.openstreetmap.org/copyright">OpenStreetMap</a>') %>%
  addCircles(~longitude, ~latitude, color = ~cr, opacity = 1)
  print(new_map)
}

We cannot detect specific areas of the cities tourists might prefer. For all locations, downtown areas seem popular with both groups of users.

Mapping Tourist Effect in each City

We now visualize tourist preferred businesses in a different way. Here the businesses that are rated higher by tourists than their overall averages are shown in green and the businesses that are rated lower by tourists than their overall averages are shown in red.

dev.off()
## null device 
##           1
business_tourist_effect <- read_csv("business_w_tourist_effect.csv")
all_map8 <- left_join(business_tourist_effect, 
                      unique(select(all_map2, business_id, longitude, latitude, business_city)))
## Joining by: "business_id"
all_map8 <- all_map8 %>% mutate(cr = ifelse(tourist_i <= 0, "red", "green"))
for(i in 1:length(world_city)){
  dest_city <- all_map8 %>% filter(business_city == world_city[i])
  new_map <- leaflet(data = dest_city) %>% 
  addTiles(urlTemplate = "http://{s}.tiles.wmflabs.org/bw-mapnik/{z}/{x}/{y}.png", 
           attribution = '&copy; <a href="http://www.openstreetmap.org/copyright">OpenStreetMap</a>') %>%
  addCircles(~longitude, ~latitude, color = ~cr, opacity = 1)
  print(new_map)
}

Again, we cannot detect specific areas of the cities tourists might prefer.

Mapping Number of Tourists Dining at Each Business by City

In this map, the gradient of the circles represent the number of tourists dining at each business. The darker the circles, the more the number of tourists.

all_map9 <- left_join(all_map3, select(all_map4, business_id, longitude, latitude, business_city), 
                      by = "business_id") 
all_map9 <- all_map9 %>% filter(tourist == TRUE & tourist_review_count > 10) %>% unique
plt2 <- colorRampPalette(c("#f2f2f2", "red"))
colors2 <- plt2(sqrt(max(all_map9$tourist_review_count)))
c_index <- sqrt(all_map9$tourist_review_count)
colors3 <- colors2[c_index]
all_map9 <- cbind(all_map9, colors3)

for(i in 1:length(world_city)){
  dest_city <- all_map9 %>% filter(business_city == world_city[i])
  new_map <- leaflet(data = dest_city) %>% 
  addTiles(urlTemplate = "http://{s}.tiles.wmflabs.org/bw-mapnik/{z}/{x}/{y}.png", 
           attribution = '&copy; <a href="http://www.openstreetmap.org/copyright">OpenStreetMap</a>') %>%
  addCircles(~longitude, ~latitude, color = ~colors3, opacity = 1)
  print(new_map)
}
Mapping Average Locals vs Tourist Rating for Top 25 Food Categories

We now examine the ratings by tourists vs. locals in top 25 food categories for each city.

#filter only food businesses
business_food <- business_full %>%
  filter(grepl("Restaurants", as.character(categories)) | 
           grepl("Cafe", as.character(categories)))
#create dataset with indicator variables for each food category as before
business_food_cat <- cbind(unlist(rep(business_food$business_id,lapply(business_food$categories,length))),
                           unlist(business_food$categories))
business_food_cat <- data.frame(business_food_cat)
colnames(business_food_cat) <- c("business_id", "category")
all_map5 <- inner_join(all_map4, business_food_cat)
## Joining by: "business_id"
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
#plot
for(i in 1:length(world_city)){
  dest_city <- filter(all_map5, business_city == world_city[i])
  cats <- dest_city %>% 
    group_by(category) %>% 
    summarise(n = n()) %>%
    arrange(desc(n))
  print(dest_city %>%
    filter(category %in% cats$category[1:25]) %>%
    group_by(category, tourist) %>%
    summarise(stars = mean(tourist_stars), se = 1.96*sd(tourist_stars)/sqrt(n())) %>%
    ggplot(aes(x = category, y = stars, group = tourist)) +
    geom_point(aes(color = tourist)) +
    geom_errorbar(aes(ymax = stars + se, ymin=stars - se), width=0.2) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(x = "Food Category", y = "Average Stars", 
         title = paste("Locals vs. Tourist ratings in", world_city[i], "in Top 25 Categories")))
}

## Warning: Removed 1 rows containing missing values (geom_errorbar).

There are some interesting patterns in how different categories are rated across the categories but there’s not a lot of difference since the confidence intervals overlap for most data points.

Mapping Locals vs Tourist ratings for each business by city

We now examine tourist vs locals preference at a finer level by plotting average tourist vs local rating for each business.

all_map6 <- all_map4 %>% filter(business_id %in% unique(business_food$business_id))
all_map6 <- all_map6 %>% group_by(business_id) %>% filter(n() == 2)
all_map6 <- all_map6 %>% mutate(tourist = ifelse(tourist, 1, 0))
all_map6 <- all_map6 %>% select(business_id, business_city, tourist, tourist_stars)
all_map7 <- all_map6 %>% spread(tourist, tourist_stars, convert = TRUE)
colnames(all_map7)[3] <- "avg_local_star"
colnames(all_map7)[4] <- "avg_trst_star"

for(i in 1:length(world_city)){
  dest_city <- filter(all_map7, business_city == world_city[i])
  print(dest_city %>%
    ggplot(aes(x = avg_local_star, y=avg_trst_star)) +
    geom_point(colour = "seagreen", alpha = 0.5) +
    geom_abline(intercept = 0, slope = 1, color="red", linetype="dashed", size=0.5) +
    labs(x = "Average Local Star", y = "Average Tourist Star", 
         title = paste("Locals vs. Tourist ratings by business in", world_city[i])))
}

Most of the data points seem to be correlated and businesses rated higher by locals are also rated higher by tourists. For the businesses that are rated low by locals, tourists seem to be more generous in general. This applies less to businesses rated low by tourists.